106 research outputs found

    From a Conceptual Model to a Knowledge Graph for Genomic Datasets

    Get PDF
    Data access at genomic repositories is problematic, as data is described by heterogeneous and hardly comparable metadata. We previously introduced a unified conceptual schema, collected metadata in a single repository and provided classical search methods upon them. We here propose a new paradigm to support semantic search of integrated genomic metadata, based on the Genomic Knowledge Graph, a semantic graph of genomic terms and concepts, which combines the original information provided by each source with curated terminological content from specialized ontologies. Commercial knowledge-assisted search is designed for transparently supporting keyword-based search without explaining inferences; in biology, inference understanding is instead critical. For this reason, we propose a graph-based visual search for data exploration; some expert users can navigate the semantic graph along the conceptual schema, enriched with simple forms of homonyms and term hierarchies, thus understanding the semantic reasoning behind query results

    Experiences in the development of a data management system for genomics

    Get PDF
    GMQL is a high-level query language for genomics, which operates on datasets described through GDM, a unifying data model for processed data formats. They are ingredients for the integration of processed genomic datasets, i.e. of signals produced by the genome after sequencing and long data extraction pipelines. While most of the processing load of today’s genomic platforms is due to data extraction pipelines, we anticipate soon a shift of attention towards processed datasets, as such data are being collected by large consortia and are becoming increasingly available. In our view, biology and personalized medicine will increasingly rely on data extraction and analysis methods for inferring new knowledge from existing heterogeneous repositories of processed datasets, typically augmented with the results of experimental data targeting individuals or small populations. While today’s big data are raw reads of the sequencing machines, tomorrow’s big data will also include billions or trillions of genomic regions, each featuring specific values depending on the processing conditions. Coherently, GMQL is a high-level, declarative language inspired by big data management, and its execution engines include classic cloud-based systems, from Pig to Flink to SciDB to Spark. In this paper, we discuss how the GMQL execution environment has been developed, by going through a major version change that marked a complete system redesign; we also discuss our experiences in comparatively evaluating the four platforms

    Filming a live cell by scanning electrochemical microscopy: label-free imaging of the dynamic morphology in real time

    Get PDF
    The morphology of a live cell reflects the organization of the cytoskeleton and the healthy status of the cell. We established a label-free platform for monitoring the changing morphology of live cells in real time based on scanning electrochemical microscopy (SECM). The dynamic morphology of a live human bladder cancer cell (T24) was revealed by time-lapse SECM with dissolved oxygen in the medium solution as the redox mediator. Detailed local movements of cell membrane were presented by time-lapse cross section lines extracted from time-lapse SECM. Vivid dynamic morphology is presented by a movie made of time-lapse SECM images. The morphological change of the T24 cell by non-physiological temperature is in consistence with the morphological feature of early apoptosis. To obtain dynamic cellular morphology with other methods is difficult. The non-invasive nature of SECM combined with high resolution realized filming the movements of live cells

    Towards a Definitive Measure of Repetitiveness

    Get PDF
    Unlike in statistical compression, where Shannon’s entropy is a definitive lower bound, no such clear measure exists for the compressibility of repetitive sequences. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size z of the Lempel–Ziv parse are frequently used to estimate repetitiveness. Recently, a more principled measure, the size γ of the smallest string attractor, was introduced. The measure γ lower bounds all the previous relevant ones (including z), yet length-n strings can be represented and efficiently indexed within space O(γlognγ), which also upper bounds most measures (including z). While γ is certainly a better measure of repetitiveness than z, it is NP-complete to compute, and no o(γlog n) -space representation of strings is known. In this paper, we study a smaller measure, δ≤ γ, which can be computed in linear time. We show that δ better captures the compressibility of repetitive strings. For every length n and every value δ≥ 2, we construct a string such that γ=Ω(δlognδ). Still, we show a representation of any string S in O(δlognδ) space that supports direct access to any character S[i] in time O(lognδ) and finds the occ occurrences of any pattern P[1.m] in time O(mlog n+ occlogεn) for any constant ε> 0. Further, we prove that no o(δlog n) -space representation exists: for every length n and every value 2 ≤ δ≤ n1-ε, we exhibit a string family whose elements can only be encoded in Ω(δlognδ) space. We complete our characterization of δ by showing that, although γ, z, and other repetitiveness measures are always O(δlognδ), for strings of any length n, the smallest context-free grammar can be of size Ω(δlog2n/ log log n). No such separation is known for γ

    On the power and the systematic biases of the detection of chromosomal inversions by paired-end genome sequencing

    Get PDF
    One of the most used techniques to study structural variation at a genome level is paired-end mapping (PEM). PEM has the advantage of being able to detect balanced events, such as inversions and translocations. However, inversions are still quite difficult to predict reliably, especially from high-throughput sequencing data. We simulated realistic PEM experiments with different combinations of read and library fragment lengths, including sequencing errors and meaningful base-qualities, to quantify and track down the origin of false positives and negatives along sequencing, mapping, and downstream analysis. We show that PEM is very appropriate to detect a wide range of inversions, even with low coverage data. However, % of inversions located between segmental duplications are expected to go undetected by the most common sequencing strategies. In general, longer DNA libraries improve the detectability of inversions far better than increments of the coverage depth or the read length. Finally, we review the performance of three algorithms to detect inversions -SVDetect, GRIAL, and VariationHunter-, identify common pitfalls, and reveal important differences in their breakpoint precisions. These results stress the importance of the sequencing strategy for the detection of structural variants, especially inversions, and offer guidelines for the design of future genome sequencing projects

    Ratio of the Isolated Photon Cross Sections at \sqrt{s} = 630 and 1800 GeV

    Get PDF
    The inclusive cross section for production of isolated photons has been measured in \pbarp collisions at s=630\sqrt{s} = 630 GeV with the \D0 detector at the Fermilab Tevatron Collider. The photons span a transverse energy (ETE_T) range from 7-49 GeV and have pseudorapidity η<2.5|\eta| < 2.5. This measurement is combined with to previous \D0 result at s=1800\sqrt{s} = 1800 GeV to form a ratio of the cross sections. Comparison of next-to-leading order QCD with the measured cross section at 630 GeV and ratio of cross sections show satisfactory agreement in most of the ETE_T range.Comment: 7 pages. Published in Phys. Rev. Lett. 87, 251805, (2001

    Testing the Ortholog Conjecture with Comparative Functional Genomic Data from Mammals

    Get PDF
    A common assumption in comparative genomics is that orthologous genes share greater functional similarity than do paralogous genes (the “ortholog conjecture”). Many methods used to computationally predict protein function are based on this assumption, even though it is largely untested. Here we present the first large-scale test of the ortholog conjecture using comparative functional genomic data from human and mouse. We use the experimentally derived functions of more than 8,900 genes, as well as an independent microarray dataset, to directly assess our ability to predict function using both orthologs and paralogs. Both datasets show that paralogs are often a much better predictor of function than are orthologs, even at lower sequence identities. Among paralogs, those found within the same species are consistently more functionally similar than those found in a different species. We also find that paralogous pairs residing on the same chromosome are more functionally similar than those on different chromosomes, perhaps due to higher levels of interlocus gene conversion between these pairs. In addition to offering implications for the computational prediction of protein function, our results shed light on the relationship between sequence divergence and functional divergence. We conclude that the most important factor in the evolution of function is not amino acid sequence, but rather the cellular context in which proteins act

    ART: A machine learning Automated Recommendation Tool for synthetic biology

    Get PDF
    Biology has changed radically in the last two decades, transitioning from a descriptive science into a design science. Synthetic biology allows us to bioengineer cells to synthesize novel valuable molecules such as renewable biofuels or anticancer drugs. However, traditional synthetic biology approaches involve ad-hoc engineering practices, which lead to long development times. Here, we present the Automated Recommendation Tool (ART), a tool that leverages machine learning and probabilistic modeling techniques to guide synthetic biology in a systematic fashion, without the need for a full mechanistic understanding of the biological system. Using sampling-based optimization, ART provides a set of recommended strains to be built in the next engineering cycle, alongside probabilistic predictions of their production levels. We demonstrate the capabilities of ART on simulated data sets, as well as experimental data from real metabolic engineering projects producing renewable biofuels, hoppy flavored beer without hops, and fatty acids. Finally, we discuss the limitations of this approach, and the practical consequences of the underlying assumptions failing

    Ageing, adipose tissue, fatty acids and inflammation

    Get PDF
    A common feature of ageing is the alteration in tissue distribution and composition, with a shift in fat away from lower body and subcutaneous depots to visceral and ectopic sites. Redistribution of adipose tissue towards an ectopic site can have dramatic effects on metabolic function. In skeletal muscle, increased ectopic adiposity is linked to insulin resistance through lipid mediators such as ceramide or DAG, inhibiting the insulin receptor signalling pathway. Additionally, the risk of developing cardiovascular disease is increased with elevated visceral adipose distribution. In ageing, adipose tissue becomes dysfunctional, with the pathway of differentiation of preadipocytes to mature adipocytes becoming impaired; this results in dysfunctional adipocytes less able to store fat and subsequent fat redistribution to ectopic sites. Low grade systemic inflammation is commonly observed in ageing, and may drive the adipose tissue dysfunction, as proinflammatory cytokines are capable of inhibiting adipocyte differentiation. Beyond increased ectopic adiposity, the effect of impaired adipose tissue function is an elevation in systemic free fatty acids (FFA), a common feature of many metabolic disorders. Saturated fatty acids can be regarded as the most detrimental of FFA, being capable of inducing insulin resistance and inflammation through lipid mediators such as ceramide, which can increase risk of developing atherosclerosis. Elevated FFA, in particular saturated fatty acids, maybe a driving factor for both the increased insulin resistance, cardiovascular disease risk and inflammation in older adults
    corecore